Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for adding intervals to dates #2031

Merged
merged 10 commits into from
Jul 15, 2022

Conversation

avantgardnerio
Copy link
Contributor

@avantgardnerio avantgardnerio commented Jul 8, 2022

Re #527.

Rationale for this change

Support for adding scalar intervals to scalar dates was added in datafusion, but in order to support adding columns to columns, we need to move that logic down into an arrow kernel.

What changes are included in this PR?

  • Support for adding interval types to date types
  • A change to add_dyn to allow it to accept lambda functions with two differing parameter types

Are there any user-facing changes?

Adding intervals to dates should work now.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 8, 2022
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @avantgardnerio. It would be good to get a review from @alamb as well.

@alamb
Copy link
Contributor

alamb commented Jul 8, 2022

Looks like there is a minor clippy error: https://github.com/apache/arrow-rs/runs/7256871633?check_suite_focus=true

@avantgardnerio
Copy link
Contributor Author

Looks like there is a minor clippy error: https://github.com/apache/arrow-rs/runs/7256871633?check_suite_focus=true

I had to re-install ubuntu to get clippy to work, but the next iteration should pass.

@avantgardnerio avantgardnerio requested a review from andygrove July 8, 2022 19:57
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @avantgardnerio -- I really like where this is headed.

cc @tustvold @viirya and @HaoYang670

arrow/Cargo.toml Outdated Show resolved Hide resolved
arrow/Cargo.toml Outdated Show resolved Hide resolved
arrow/src/compute/kernels/arithmetic.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/arithmetic.rs Show resolved Hide resolved
arrow/src/datatypes/types.rs Outdated Show resolved Hide resolved
@avantgardnerio avantgardnerio requested a review from alamb July 8, 2022 21:01
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.

// Copied from chronoutil crate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why copying the code, instead of using it as dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a conversation in the equivalent arrow-rs commit: apache/datafusion#2797 (comment)

I would vote to make it a dependency, but am fine either way.

Copy link
Contributor

@alamb alamb Jul 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the contradictory feedback -- I defer to @viirya in terms of using dependency or inlining. Given how little actual code it is, I kind of like the avoidance of a new dependency but I agree this is an opinion rather than anything driven by other considerations.

The reason I personally prefer to avoid dependencies are they are really hard to remove (e.g. #1882) and there are many of these rust crates which seem to have an initial spurt of brilliance and then basically become unmaintained as the authors move on to something else. Though to be honest chrono itself kind of falls into that category too 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Chrono folks re-opened my PR, so I don't think this will be a concern for too long - it looks like they are interested in building a add_months() function into chrono, we're just debating where and how.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @alamb. The code copied is little so not a strong concern from me. I think we can copy the code now and maybe consider to remove it by adding chrono as dependency if it shows stable maintenance in the future.

@codecov-commenter
Copy link

codecov-commenter commented Jul 8, 2022

Codecov Report

Merging #2031 (5ef4a1f) into master (ca1bfb8) will increase coverage by 0.07%.
The diff coverage is 95.06%.

❗ Current head 5ef4a1f differs from pull request most recent head 911b2db. Consider uploading reports for the commit 911b2db to get more accurate results

@@            Coverage Diff             @@
##           master    #2031      +/-   ##
==========================================
+ Coverage   83.54%   83.61%   +0.07%     
==========================================
  Files         222      223       +1     
  Lines       58178    58460     +282     
==========================================
+ Hits        48604    48884     +280     
- Misses       9574     9576       +2     
Impacted Files Coverage Δ
arrow/src/datatypes/mod.rs 99.24% <ø> (ø)
parquet/src/arrow/array_reader/list_array.rs 92.69% <0.00%> (ø)
arrow/src/ffi.rs 87.52% <72.72%> (+0.34%) ⬆️
parquet/src/arrow/record_reader/mod.rs 89.17% <79.59%> (-0.19%) ⬇️
arrow/src/datatypes/ffi.rs 76.56% <92.85%> (+3.83%) ⬆️
arrow/src/compute/kernels/arithmetic.rs 93.75% <95.18%> (+0.12%) ⬆️
arrow/src/datatypes/types.rs 97.46% <98.57%> (+8.57%) ⬆️
arrow/src/array/builder/generic_binary_builder.rs 83.78% <100.00%> (+26.64%) ⬆️
arrow/src/array/builder/generic_list_builder.rs 95.09% <100.00%> (-1.61%) ⬇️
arrow/src/array/builder/generic_string_builder.rs 92.13% <100.00%> (+10.08%) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca1bfb8...911b2db. Read the comment docs.

@@ -191,3 +194,166 @@ impl ArrowTimestampType for TimestampNanosecondType {
TimeUnit::Nanosecond
}
}

impl IntervalYearMonthType {
/// Creates a IntervalYearMonthType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like new returns the native value, not IntervalYearMonthType type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I struggled a lot with this. It's very confusing the distinction between these. It technically returns a <Date64Type as IntervalYearMonthType>::Native which is just an i32. I question where to even put these methods, or if they belong on this impl, or what to call them in the first place. I think the awkwardness comes from arrow not really having a row, so heretofor there was nothing that operated on an individual value, but also thus some bad tests that did not properly parse/populate these, so I think some standard way to convert would be very helpful, hence the addition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the functionality for manipulating values on to this impl makes sense

Perhaps @viirya was referring to the rust convention that Type::new() returns an instance of Type -- perhaps if we renamed this method make_value or something like that it would be less surprising for other rust developers.

We could do it as a follow on PR as well (before releasing arrow 19.x) too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, makes sense to me.

@avantgardnerio
Copy link
Contributor Author

Could anyone help me with reproducing this locally, or understanding what went wrong?

==========================================================
Testing file duration
==========================================================
Traceback (most recent call last):
  File "/arrow/dev/archery/archery/integration/runner.py", line 246, in _run_ipc_test_case
    run_binaries(producer, consumer, test_case)
  File "/arrow/dev/archery/archery/integration/runner.py", line 286, in _produce_consume
    consumer.stream_to_file(producer_stream_path, consumer_file_path)
  File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in stream_to_file
    self.run_shell_command(cmd)
  File "/arrow/dev/archery/archery/integration/tester.py", line 49, in run_shell_command
    subprocess.check_call(cmd, shell=True)
  File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest --mode stream-to-file -a /tmp/tmpzg7b26id/29338ab4_generated_datetime.consumer_stream_as_file < /tmp/tmpzg7b26id/29338ab4_generated_datetime.producer_file_as_stream' returned non-zero exit status 1.
################# FAILURES #################
Traceback (most recent call last):
FAILED TEST: datetime Java producing,  C# consuming
  File "/arrow/dev/archery/archery/integration/runner.py", line 246, in _run_ipc_test_case
    run_binaries(producer, consumer, test_case)
  File "/arrow/dev/archery/archery/integration/runner.py", line 286, in _produce_consume
    consumer.stream_to_file(producer_stream_path, consumer_file_path)
  File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in stream_to_file
    self.run_shell_command(cmd)
  File "/arrow/dev/archery/archery/integration/tester.py", line 49, in run_shell_command
    subprocess.check_call(cmd, shell=True)
  File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest --mode stream-to-file -a /tmp/tmpzg7b26id/4b457ee3_generated_datetime.consumer_stream_as_file < /tmp/tmpzg7b26id/4b457ee3_generated_datetime.producer_file_as_stream' returned non-zero exit status 1.
FAILED TEST: datetime Rust producing,  C# consuming

Thanks!

@tustvold
Copy link
Contributor

@avantgardnerio It's a flaky test #1931 that should have now been fixed apache/arrow#13573. I'll rerun the CI job and it should clear

@avantgardnerio
Copy link
Contributor Author

@avantgardnerio It's a flaky test #1931 that should have now been fixed apache/arrow#13573. I'll rerun the CI job and it should clear

Looks great, thanks!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great -- thank you @avantgardnerio for figuring out a pattern and starting to implement this arithmetic. ❤️

I left one more comment but I also think this PR could be merged as is.

@viirya or @tustvold would you like to review again prior to merging?

NaiveDate::from_ymd(2000, 1, 1),
)]);
let b = IntervalYearMonthArray::from(vec![IntervalYearMonthType::new(1, 2)]);
let c = add_dyn(&a, &b).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is so great to see add_dyn used like this ❤️

@@ -191,3 +194,166 @@ impl ArrowTimestampType for TimestampNanosecondType {
TimeUnit::Nanosecond
}
}

impl IntervalYearMonthType {
/// Creates a IntervalYearMonthType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the functionality for manipulating values on to this impl makes sense

Perhaps @viirya was referring to the rust convention that Type::new() returns an instance of Type -- perhaps if we renamed this method make_value or something like that it would be less surprising for other rust developers.

We could do it as a follow on PR as well (before releasing arrow 19.x) too

@avantgardnerio
Copy link
Contributor Author

renamed this method make_value

I think that is both valid, and non-trivial. I've renamed and pushed 🙂

@andygrove andygrove merged commit cb7e5b0 into apache:master Jul 15, 2022
@ursabot
Copy link

ursabot commented Jul 15, 2022

Benchmark runs are scheduled for baseline = 9d8f0c9 and contender = cb7e5b0. cb7e5b0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb
Copy link
Contributor

alamb commented Jul 15, 2022

🎉

@alamb
Copy link
Contributor

alamb commented Aug 12, 2022

@avantgardnerio I wonder if you are tracking the "add support for adding interval columns to dates/timestamps" in datafusion somewhere?

I ask because apache/datafusion#3110 by @JasonLi-cn is starting to add support for timestamps

@avantgardnerio
Copy link
Contributor Author

if you are tracking

I was not, but am now, ty!

MazterQyou pushed a commit to cube-js/arrow-rs that referenced this pull request Dec 5, 2023
MazterQyou pushed a commit to cube-js/arrow-rs that referenced this pull request Dec 8, 2023
mcheshkov pushed a commit to cube-js/arrow-rs that referenced this pull request Aug 21, 2024
Can drop this after rebase on commit cb7e5b0 "Add support for adding intervals to dates (apache#2031)", first released in 19.0.0
mcheshkov pushed a commit to cube-js/arrow-rs that referenced this pull request Aug 21, 2024
Can drop this after rebase on commit cb7e5b0 "Add support for adding intervals to dates (apache#2031)", first released in 19.0.0
mcheshkov pushed a commit to cube-js/arrow-rs that referenced this pull request Aug 21, 2024
Can drop this after rebase on commit cb7e5b0 "Add support for adding intervals to dates (apache#2031)", first released in 19.0.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants